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ABSTRACT 

Many  classification  tasks  involve  linked  nodes,  such  as  peo¬ 
ple  connected  by  friendship  links.  For  such  networks,  accu¬ 
racy  might  be  increased  by  including,  for  each  node,  the  (a) 
labels  or  (b)  attributes  of  neighboring  nodes  as  model  fea¬ 
tures.  Recent  work  has  focused  on  option  (a),  because  early 
work  showed  it  was  more  accurate  and  because  option  (b) 
fit  poorly  with  discriminative  classifiers.  We  show,  however, 
that  when  the  network  is  sparsely  labeled,  “relational  classi¬ 
fication”  based  on  neighbor  attributes  often  has  higher  accu¬ 
racy  than  “collective  classification”  based  on  neighbor  labels. 
Moreover,  we  introduce  an  efficient  method  that  enables  dis¬ 
criminative  classifiers  to  be  used  with  neighbor  attributes, 
yielding  further  accuracy  gains.  We  show  that  these  effects 
are  consistent  across  a  range  of  datasets,  learning  choices, 
and  inference  algorithms,  and  that  using  both  neighbor  at¬ 
tributes  and  labels  often  produces  the  best  accuracy. 
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I.  INTRODUCTION 

Many  problems  in  communications,  social  networks,  bi¬ 
ology,  business,  etc.  involve  classifying  nodes  in  a  graph. 
For  instance,  consider  predicting  a  class  label  for  each  page 
(node)  in  a  set  of  linked  webpages,  where  some  node  labels 
are  provided  for  learning.  A  traditional  method  would  use 
the  attributes  of  each  page  (e.g.,  words  in  the  page)  to  pre¬ 
dict  its  label.  In  contrast,  link-based  classification  [2,  13]  also 
uses,  for  each  node,  the  attributes  or  labels  of  neighboring 
pages  as  model  features.  If  “neighbor  labels”  are  used,  then 
an  iterative  algorithm  for  collective  inference  is  needed,  since 


This  paper  is  authored  by  an  employee(s)  of  the  United  States  Government  and  is  in  the 
public  domain.  Non-exclusive  copying  or  redistribution  is  allowed,  provided  that  the 
article  citation  is  given  and  the  authors  and  agency  are  clearly  identified  as  its  source. 
CIKM’13,  Oct.  27-Nov.  1,  2013,  San  Francisco,  CA,  USA. 

2013  ACM  978-1-4503-2263-8/13/10 
http://dx.doi.org/10. 1 145/25055 15.2505628 


David  W.  Aha 

Navy  Center  for  Applied  Research  in  Al 
Naval  Research  Laboratory,  Code  5514 
Washington,  DC  U.S.A. 
david.aha@nrl.navy.mil 

many  labels  are  initially  unknown  [3] .  If,  on  the  other  hand, 
“neighbor  attributes”  are  used,  then  a  single  step  of  relational 
inference  suffices,  since  all  attribute  values  are  known. 

Despite  the  additional  complexity  of  inference,  recent  work 
has  used  collective  inference  (Cl)  much  more  frequently  than 
relational  inference  (RI)  for  two  reasons.  First,  multiple  al¬ 
gorithms  for  Cl  (e.g.,  belief  propagation,  Gibbs  sampling, 
ICA)  can  substantially  increase  classification  accuracy  [8, 
10].  In  contrast,  comparisons  found  RI  to  be  inferior  to 
Cl  [3]  and  to  sometimes  even  decrease  accuracy  compared  to 
methods  that  ignore  links  [2].  Second,  although  RI  does  not 
require  multiple  inference  steps,  using  neighbor  attributes  as 
model  features  is  more  complex  than  with  neighbor  labels, 
due  to  the  interplay  between  the  larger  number  of  attributes 
(vs.  one  label)  and  a  varying  number  of  neighbors  for  each 
node.  In  particular,  RI  does  not  naturally  mesh  with  popu¬ 
lar,  discriminative  classifiers  such  as  logistic  regression. 

Most  work  on  link-based  classification  assumes  a  fully- 
labeled  training  graph.  However,  often  (e.g.,  for  social  and 
webpage  networks)  collecting  the  node  attributes  and  link 
structure  for  this  graph  may  be  easy,  but  acquiring  the  de¬ 
sired  labels  can  be  much  more  expensive  [11,6].  In  response, 
recent  studies  have  examined  Cl  methods  with  partially- 
labeled  training  graphs,  using  some  semi-supervised  learning 
(SSL)  to  leverage  the  unlabeled  portion  of  the  graph  [15,  1, 
6].  However,  because  using  neighbor  attributes  seemed  dif¬ 
ficult  and  unnecessary,  none  evaluated  RI. 

Our  contributions  are  as  follows.  First,  we  provide  the 
first  evaluation  of  link-based  classification  that  compares 
models  based  on  neighbor  labels  (Cl)  vs.  models  based  on 
neighbor  attributes  (RI) ,  for  sparsely-labeled  networks.  Un¬ 
like  prior  studies  with  fully-labeled  training  networks,  we 
find  that  RI  is  often  significantly  more  accurate  than  CI. 
Second,  we  introduce  an  efficient  technique,  Multi-Neighbor 
Attribute  Classification  (MNAC),  that  enables  discrimina¬ 
tive  classifiers  like  logistic  regression  to  be  used  with  neigh¬ 
bor  attributes,  further  increasing  accuracy.  Finally,  we  show 
that  the  advantages  of  RI  with  MNAC  remain  with  a  variety 
of  datasets,  learning  algorithms,  and  inference  algorithms. 
We  find  that,  surprisingly,  RI’s  gains  remain  even  for  data 
conditions  where  CI  was  thought  to  be  clearly  preferable  [3] . 

2.  LINK-BASED  CLASSIFICATION 

Assume  we  are  given  a  graph  G  =  (V,  E,  X,  Y,  C)  where 
V  is  a  set  of  nodes,  E  is  a  set  of  edges  (links),  each  xl  £  X  is 
an  attribute  vector  for  a  node  Vi  £  V,  each  Yi  £  Y  is  a  label 
variable  for  Vi,  and  C  is  the  set  of  possible  labels.  We  are  also 
given  a  set  of  “known”  values  YK  for  nodes  VK  C  V,  so  that 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

NOV  2013 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2013  to  00-00-2013 

4.  TITLE  AND  SUBTITLE 

Labels  or  Attributes?  Rethinking  the  Neighbors  for  Collective 
Classification  in  Sparsely-Labeled  Networks 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Naval  Research  Laboratory, Center  for  Applied  Research  in  Artificial 
Intelligence  (Code  5514), 4555  Overlook  Ave.,  SW, Washington, DC, 20375 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

in  International  Conference  on  Information  and  Knowledge  Management,  San  Francisco,  CA,  27  Oct  ?  1 
Nov  2013. 

14.  ABSTRACT 

Many  classi  cation  tasks  involve  linked  nodes,  such  as  peo-  pie  connected  by  friendship  links.  For  such 
networks,  accu-  racy  might  be  increased  by  including,  for  each  node,  the  (a)  labels  or  (b)  attributes  of 
neighboring  nodes  as  model  fea-  tures.  Recent  work  has  focused  on  option  (a),  because  early  work  showed 
it  was  more  accurate  and  because  option  (b)  t  poorly  with  discriminative  classi  ers.  We  show,  however  that 
when  the  network  is  sparsely  labeled,  elational  classi-  cation"  based  on  neighbor  attributes  often  has 
higher  accu-  racy  thancollective  classi  cation"based  on  neighbor  labels.  Moreover,  we  introduce  an  e  dent 
method  that  enables  dis-  criminative  classi  ers  to  be  used  with  neighbor  attributes  yielding  further 
accuracy  gains.  We  show  that  these  e  ects  are  consistent  across  a  range  of  datasets,  learning  choices  and 
inference  algorithms,  and  that  using  both  neighbor  at-  tributes  and  labels  often  produces  the  best  accuracy. 


15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

ABSTRACT 

18.  NUMBER 

OF  PAGES 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

6 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Table  1:  Types  of  models,  based  on  the  kinds  of  features 
used.  Some  notation  is  adapted  from  Jensen  et  al.  [3]. 


Model 

Self  attr. 

Neigh,  attr. 

Neigh,  labels 

RCI 

/ 

/ 

/ 

RI 

/ 

/ 

Cl 

/ 

/ 

SelfAttrs 

/ 

NeighAttrs 

/ 

NeighLabels 
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Yk  =  {yi\vi  G  VK}.  Then  the  within-network  classification 
task  is  to  infer  Yu ,  the  values  of  Y,  for  the  remaining  nodes 
Vu  with  “unknown”  values  (Vu  =  V  \  VK ). 

For  example,  given  a  (partially-labeled)  set  of  interlinked 
university  webpages,  consider  the  task  of  predicting  whether 
each  page  belongs  to  a  professor  or  a  student.  There  are 
three  kinds  of  features  typically  used  for  this  task: 

•  Self  attributes:  features  based  on  the  the  textual 
content  of  each  page  (node),  e.g.,  the  presence  or  ab¬ 
sence  of  the  word  “teaching”  for  node  v. 

•  Neighbor  attributes:  features  based  on  the  attributes 
of  pages  that  link  to  v.  These  may  be  useful  because, 
e.g.,  pages  often  link  to  others  with  the  same  label. 

•  Neighbor  labels:  features  based  on  the  labels  of 
pages  that  link  to  v,  such  as  “Count  the  number  of 
v’s  neighbors  with  label  Student.” 

Table  1  characterizes  classification  models  based  on  the 
kinds  of  features  they  use.  The  simplest,  baseline  models  use 
only  one  kind.  First,  SelfAttrs  uses  only  self  attributes. 
Second,  NeighAttrs  classifies  a  node  v  using  only  v’s  neigh¬ 
bors’  attributes.  This  model  has  not  been  previously  been 
studied,  but  we  use  it  to  help  measure  the  value  of  neighbor 
attributes  on  their  own.  Finally,  NeighLabels  uses  only 
neighbor  labels.  For  instance,  it  might  repeatedly  average 
the  predicted  label  distributions  of  a  node’s  neighbors;  this 
performs  surprisingly  well  for  some  datasets  [5]. 

Other  models  combine  self  attributes  with  other  features. 
If  a  model  also  uses  neighbor  attributes,  then  it  is  perform¬ 
ing  “relational  inference”  and  we  call  it  RI.  A  Cl  model  uses 
neighbor  labels  instead,  via  features  like  the  “count  Stu¬ 
dents”  described  above.  However,  this  is  challenging,  be¬ 
cause  some  labels  are  unknown  and  must  be  estimated,  typ¬ 
ically  with  an  iterative  process  of  collective  inference  (i.e., 
Cl)  [3].  Cl  methods  include  Gibbs  sampling,  belief  propa¬ 
gation,  and  ICA  (Iterative  Classification  Algorithm)  [10]. 

We  focus  on  ICA,  a  simple,  popular,  and  effective  algo¬ 
rithm  [10,  1,  6].  ICA  first  predicts  a  label  for  every  node  in 
Vu  using  only  self  attributes.  It  then  constructs  additional 
relational  features  Xr  using  the  known  and  predicted  node 
labels  (Yk  and  Yu),  and  re-predicts  labels  for  Vu  using 
both  self  attributes  and  Xr.  This  process  of  feature  com¬ 
putation  and  prediction  is  repeated,  e.g.,  until  convergence. 

Finally,  RCI  uses  all  three  kinds  of  features.  Because  it 
uses  neighbor  labels,  it  also  must  use  some  kind  of  collective 
inference  such  as  ICA. 

2.1  Prior  Work  with  Neighbor  Attributes 

Some  early  work  on  link-based  classification  evaluated  mod¬ 
els  that  included  neighbor  attributes  [2,  13].  However,  re¬ 
cent  work  has  used  such  models  (including  RI  and  RCI) 
very  rarely  for  two  primary  reasons.  First,  prior  work  found 
that,  while  using  neighbor  labels  can  increase  accuracy,  us¬ 
ing  neighbor  attributes  can  actually  decrease  accuracy  [2]. 


Figure  1:  Accuracy  for  “Gene”  using  Naive  Bayes  and 
SSL-Once  (see  Section  5).  Within  a  column,  a  circle  or 
triangle  that  is  filled  in  indicates  a  result  that  was  sig¬ 
nificantly  different  (better  or  worse)  than  that  of  Cl. 

Later,  Jensen  et  al.  [3]  compared  Cl  vs.  RI  and  RCI.  They 
found  that  Cl  somewhat  outperformed  RCI  and  performed 
much  better  than  RI.  They  describe  how  neighbor  attributes 
greatly  increased  the  parameter  space,  leading  to  lower  ac¬ 
curacy,  especially  for  small  training  graphs. 

Second,  it  is  unclear  how  to  include  neighbor  attributes  in 
popular  classifiers.  In  particular,  nodes  usually  have  a  vary¬ 
ing  number  of  neighbors.  Thus,  with  neighbor  attributes 
as  features,  there  is  no  direct  way  to  represent  a  node  in  a 
fixed-sized  feature  vector  as  expected  by  classifiers  such  as 
logistic  regression  or  SVMs.  With  neighbor  labels,  Cl  algo¬ 
rithms  address  this  issue  with  aggregation  functions  (such  as 
“Count”)  that  summarize  all  neighboring  labels  into  a  few 
feature  values.  This  works  well  for  labels,  which  are  dis¬ 
crete  and  highly  informative,  but  is  more  challenging  for  at¬ 
tributes,  which  are  more  numerous,  may  be  continuous,  and 
are  individually  less  informative  than  a  node’s  label  (thus, 
this  approach  fared  very  poorly  in  early  work  [2]). 

These  two  factors  have  produced  a  prevailing  wisdom  that 
Cl  based  on  neighbor  labels  is  better  than  RI  based  on  neigh¬ 
bor  attributes  (cf.,  [10,  9[).  This  conclusion  rested  on  studies 
with  fully-labeled  training  graphs,  but  has  been  carried  into 
the  important  domain  [11,  6]  of  sparsely-labeled  graphs.  In 
particular,  Table  2  summarizes  the  models  used  by  the  most 
relevant  prior  work  with  such  graphs.  Only  one  study  [15] 
used  models  with  neighbor  attributes  (e.g.,  with  RI  or  RCI), 
and  it  did  not  evaluate  whether  they  were  helpful.1 

We  next  show  that  this  prevailing  wisdom  was  partly  cor¬ 
rect,  but  that,  for  sparsely-labeled  networks,  neighbor  at¬ 
tributes  are  often  more  useful  than  previously  thought. 

3.  THE  IMPACT  OF  LABEL  SPARSITY 

To  aid  our  discussion,  we  first  preview  our  results.  Fig¬ 
ure  1  plots  the  average  accuracy  of  RCI,  RI,  and  Cl  for  the 
Gene  dataset  (also  used  by  Jensen  et  al.  [3]).  The  x-axis 
varies  the  label  density  (e.g.,  fraction  of  nodes  with  known 
labels).  When  label  density  is  high  (>  40%),  Cl  significantly 
outperforms  RI,  and  RCI  as  well  when  density  is  very  high. 
High  density  is  similar  to  learning  with  a  fully-labeled  graph, 
and  these  results  are  consistent  with  those  of  Jensen  et  al. 

However,  when  density  is  low  (<  20%),  Cl  is  significantly 
worse  than  RI  or  RCI.  To  our  knowledge,  no  prior  work  has 
reported  this  effect.  Why  does  it  occur? 

1  Prior  work  with  neighbor  attributes,  including  that  of  Xi¬ 
ang  &  Neville  [15],  used  methods  like  decision  trees  or  Naive 
Bayes  that  do  not  require  a  fixed- length  feature  vector. 


Table  2:  Related  work  on  link-based  classification  that  has  used  some  variant  of  semi-supervised  ICA  (with  partially- 
labeled  networks).  The  first  row  is  an  exception;  it  used  fully- labeled  training  networks  but  is  included  for  comparison. 
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First,  during  inference,  many  neighbor  labels  are  unknown. 
Thus,  a  potential  disadvantage  of  Cl  vs.  RI  is  that  some  pre¬ 
dicted  labels  used  by  Cl  will  be  incorrect,  while  the  neighbor 
attributes  used  by  RI  are  all  known.  However,  prior  work 
shows  that  Cl — when  learning  uses  a  fully-labeled  graph 
can  be  effective  even  when  all  labels  in  a  separate  test  graph 
are  initially  unknown  [8,  7].  Thus,  having  a  large  number  of 
unknown  labels  during  Cl’s  inference,  while  a  drawback,  is 
not  enough  to  explain  the  substantial  differences  vs.  RI. 

The  key  problem  is  that,  at  low  label  density,  Cl  strug¬ 
gles  to  learn  the  parameters  related  to  label-based  features. 
Labels  can  be  used  for  learning  such  features  only  where 
both  nodes  of  a  link  have  known  labels.  In  contrast,  with 
neighbor  attributes  a  single  node’s  label  makes  a  link  useful 
for  learning,  since  the  node’s  neighbors’  attributes  are  all 
known.  Thus,  with  RI  a  single  labeled  node  can  provide 
multiple  examples  for  learning  (one  for  each  of  its  links). 

For  example,  for  the  Gene  dataset  of  Figure  1,  when  the 
density  is  10%  (110  known  nodes),  Cl  can  learn  from  an 
average  of  only  20  links,  while  RI  can  use  an  average  of  340. 
Thus,  for  sparsely-labeled  networks,  the  effective  training 
size  for  features  based  on  neighbor  attributes  is  much  larger 
than  for  those  based  on  neighbor  labels.  This  compensates 
for  the  greater  number  of  parameters  required  by  neighbor 
attributes,  leading  to  higher  accuracy. 

Recent  work  with  partially-labeled  networks  has  also  ob¬ 
served  these  problems  with  neighbor  labels,  but  did  not  con¬ 
sider  neighbor  attributes.  Instead,  they  avoid  the  problem 
by  discarding  methods  based  on  label  features  and  use  la¬ 
tent  features  and/or  links  [12,  11].  Others  have  proposed 
non-learning  methods  that  use  label  propagation  or  random 
walks  [5].  Yet  others  have  proposed  SSL  variants  to  first 
predict  (noisy)  labels  for  the  entire  network  [15,  1,  6]. 

We  later  compare  against  some  of  these  methods  (e.g., 
with  [5]).  However,  Figure  l’s  results  already  include  some 
effective  SSL  methods  [1,  6]  (see  Section  5.3),  and  yet  sub¬ 
stantial  problems  remain  for  C'l  at  low  density.  Section  6.2 
compares  RCI,  RI,  and  Cl  more  completely  and  shows  that 
neighbor  attributes  are  helpful  with  or  without  SSL. 


4.  LEVERAGING  NEIGHBOR  ATTRIBUTES 

This  section  explains  existing  methods  for  predicting  with 
neighbor  attributes,  then  introduces  a  new  technique  that 
enables  the  use  of  discriminative  classifiers. 

4.1  Existing  Methods  for  Neighbor  Attributes 

Let  Mi  be  the  set  of  nodes  that  are  adjacent  to  node  Vi, 
i.e.,  in  the  “neighborhood”  of  Vi  (for  simplicity,  we  assume 


undirected  links  here).  Furthermore,  let  Xj\f(  be  the  set  of 
attribute  vectors  for  all  nodes  in  Mi  (X^  =  {xj\vj  £  Mi}). 

Suppose  we  wish  to  predict  the  label  yi  for  Vi  based  on  its 
attributes  and  the  attributes  of  Mi.  As  described  above,  the 
variable  size  of  Mi  presents  a  challenge.  To  address  this  gen¬ 
eral  issue,  prior  studies  (with  neighbor  labels)  often  assume 
that  the  labels  of  nodes  in  Mi  are  conditionally  independent 
given  yi.  This  assumption  is  not  necessarily  true,  but  often 
works  well  in  practice  [3,  8,  7].  In  our  context  (with  neigh¬ 
bor  attributes),  we  can  make  the  analogous  assumption  that 
the  attribute  vectors  of  the  nodes  in  Mi  (and  the  attribute 
vector  Xi  of  Vi  itself)  are  conditionally  independent  given  yi. 
Using  Bayes  rule  and  this  assumption  yields 


p{yi\xi,XjiTi)  =  p{yi) 


pjxi^XfS^yi) 

p{xi,Xjrt) 


=  PiVi ) 


P{xi\yi) 

p{xi,Xui) 


n  p&j \yi) 

vjeMi 


oc  p(yi)p(xi\yi)  H  p(xj\yi)  (1) 

vj  eMi 

where  the  last  step  drops  all  values  independent  of  yi. 

To  use  Equation  1,  we  must  compute p(xi\yi)  and p(xj\yi) ■ 
The  same  technique  works  for  both;  we  now  explain  for  the 
latter.  We  further  assume  that  all  attribute  values  for  Vj 
(e.g.,  the  values  inside  x})  are  independent  given  yi.  If  nodes 
have  M  attributes,  then 


p(Xj\yi)  =  p(Xjl,Xj2,  ...,  XjM\Vi) 

M 

=  Y[p{xjk\yi).  (2) 

k=  1 

Plugging  this  equation  (and  the  equivalent  one  iox  p{xi\yi)) 
into  Equation  1  yields  a  (relational)  Naive  Bayes  classi¬ 
fier  [8].  In  particular,  the  features  used  to  predict  the  label 
for  Vi  are  vfs  attributes  and  vfs  neighbors’  attributes,  and 
these  values  are  assumed  to  be  conditionally  independent. 
Jensen  et  al.  [3]  used  this  simple  generative  classifier  for 
RI,  and  a  simple  extension  to  add  the  labels  of  neighboring 
nodes  as  features  yields  the  equations  needed  for  RCI. 

4.2  Multi-neighbor  Attribute  Classification 

The  method  described  above  can  predict  with  neighbor  at¬ 
tributes,  and  can  increase  classification  accuracy,  as  we  show 
later.  However,  it  has  two  potential  shortcomings.  First,  it 
ignores  dependencies  among  the  attributes  within  a  single 
node.  Second,  it  requires  using  probabilities  like  p(xj\yi), 
whereas  a  discriminative  classifier  (e.g.,  logistic  regression) 
would  compute  p{yi\xj).  Thus,  the  set  of  classifiers  that  can 
be  used  is  constrained,  and  overall  accuracy  may  suffer. 


A  new  idea  is  to  take  Equation  1  and  apply  Bayes  rule  to 
each  conditional  probability  separately.  This  yields 
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where  the  last  step  drops  all  values  independent  of  ?/;. 

We  call  classification  based  on  Equation  3  Multi- Neighbor 
Attribute  Classification  (MNAC).  This  approach  requires 
two  conditional  models,  p(yi\xi)  and  p(yi\xj).  The  first  is 
a  standard  (self) attribute-only  classifier,  while  the  second  is 
more  unusual:  a  classifier  that  predicts  the  label  of  a  node 
based  on  the  attributes  of  one  of  its  neighbors.  Because 
the  prediction  for  this  latter  model  is  based  on  just  a  single 
attribute  vector,  any  probabilistic  classifier  can  be  used,  in¬ 
cluding  discriminative  classifiers  such  as  logistic  regression. 

MNAC’s  derivation  is  simple,  but  it  has  not  been  previ¬ 
ously  used  for  link-based  classification.  McDowell  &  Aha  [6] 
used  a  somewhat  similar  technique  to  produce  “hybrid  mod¬ 
els”  that  combine  two  classifiers,  but  the  derivation  is  differ¬ 
ent  and  they  did  not  consider  neighbor  attributes. 


5.  EXPERIMENTAL  METHOD 

5.1  Datasets  and  Features 

We  use  all  of  the  real  datasets  used  in  prior  studies  with 
semi-supervised  ICA,  and  some  synthetic  data  (see  Tables  2 
&  3).  We  removed  all  nodes  with  no  links. 

Cora  (cf.,  [10])  is  a  collection  of  machine  learning  papers. 
Citeseer  (cf.,  [10])  is  a  collection  of  research  papers.  At¬ 
tributes  represent  the  presence  of  certain  words,  and  links 
indicate  citations.  We  mimic  Bilgic  et  al.  [1]  by  ignoring  link 
direction,  and  also  by  using  the  100  top  attribute  features 
after  applying  PCA  to  all  nodes’  attributes. 

Gene  (cf.,  [3])  describes  the  yeast  genome  at  the  protein 
level;  links  represent  protein  interactions.  We  mimic  Xiang 
&  Neville  [15]  and  predict  protein  localization  using  four 
attributes:  Phenotype,  Class,  Essential,  and  Chromosome. 
When  using  LR,  we  binarized  these,  yielding  54  attributes. 

We  create  synthetic  data  using  Sen  et  al.’s  graph  generator 
[10] .  Degree  of  homophily  is  how  likely  a  node  is  to  link  to 
another  node  with  the  same  label;  we  use  their  defaults  of 
0.7  with  a  link  density  of  0.2.  Each  node  has  ten  binary 
attributes  with  an  attribute  predictiveness  of  0.6  (see  [7]). 

5.2  Classifiers  and  Regularization 

All  models  shown  in  Table  1  (except  NeighLabels)  re¬ 
quire  learning  a  classifier  to  predict  the  label  based  on  self 
attributes  and/or  a  classifier  to  predict  based  on  neighbor 
attributes.  We  evaluate  Naive  Bayes  (NB),  because  of  its 
past  use  with  neighbor  attributes  [3],  and  logistic  regres¬ 
sion  (LR),  because  it  usually  outperformed  NB  [10,  1[.  For 
neighbor  attributes,  LR  uses  the  new  MNAC  method. 

RCI  and  Cl  also  require  a  classifier  to  predict  based  on 
neighbor  labels.  McDowell  &  Aha  [6]  found  that  NB  with 
“multiset”  features  was  superior  to  LR  with  “proportion”  fea¬ 
tures  as  used  by  Bilgic  et  al.  [1] .  Thus,  we  use  NB  for  neigh¬ 
bor  labels,  and  combine  these  results  with  the  NB  or  LR 
classifiers  used  for  attributes  (described  above),  using  the 
“hybrid  model”  method  mentioned  in  Section  4.2  [6] . 


Table  3:  Data  sets  summary. 


Characteristics 

Cora 

CiteSeer 

Gene 

Syn. 

Total  nodes 

2708 

3312 

1103 

1000 

Total  links 

5278 

4536 

1672 

1250 

Class  labels 

7 

6 

2 

5 

%  dominant  class 

16% 

21% 

56% 

20% 

For  sparsely-labeled  data,  regularization  can  have  a  large 
impact  on  accuracy.  We  used  five-fold  cross-validation  on 
the  labeled  data,  selecting  the  value  of  the  regularization 
hyperparameter  that  maximized  accuracy  on  the  held-out 
labeled  data.  We  used  a  Gaussian  prior  with  all  LR’s  fea¬ 
tures,  a  Dirichlet  prior  with  NB’s  discrete  features,  and  a 
Normal-Gamma  prior  with  NB’s  continuous  features. 

5.3  Learning  and  Collective  Inference 

We  chose  default  learning  and  inference  algorithms  that 
performed  well  in  prior  studies;  Section  6.2  considers  others. 

For  learning,  we  use  the  SSL-Once  strategy:  first,  learn 
the  classifiers  using  the  attributes  and  the  known  labels. 
Next,  run  inference  to  predict  labels  for  every  “unknown” 
node.  Finally,  learn  new  classifiers,  using  the  attributes, 
known  labels,  and  newly  predicted  labels.  With  LR,  we  also 
use  “label  regularization”  [6]  which  biases  the  learning  to¬ 
wards  models  that  yield  sensible  label  distributions  (it  does 
not  apply  to  NB).  McDowell  &  Aha  [6]  found  that  these 
choices  performed  well  overall  and  had  consistent  accuracy 
gains  compared  to  not  using  SSL.  The  three  baseline  models 
(see  Table  1)  do  not  use  SSL  or  label  regularization. 

For  inference,  Cl  and  RCI  use  10  iterations  of  ICA.  For 
NeighLabels,  we  use  a  common  baseline  based  on  relax¬ 
ation  labeling  (wvRN+RL  [5],  see  Section  2). 

6.  RESULTS 

We  report  accuracy  averaged  over  20  trials.  For  each,  we 
randomly  select  some  fraction  of  V  (the  “label  density”  d)  to 
be  “known”  nodes  VK ;  those  remaining  form  the  “unknown 
label”  test  set  Vu .  We  focus  on  the  important  sparsely- 
labeled  case  [11,  6],  e.g.,  d  <  10%.  To  assess  results,  we  use 
paired  t-tests  with  a  5%  significance  level,  with  appropriate 
corrections  because  the  20  test  sets  are  not  disjoint  [14]. 

6.1  Key  Results  with  Real  Datasets 

Figure  2  shows  average  classification  accuracy  for  the  real 
datasets.  The  solid  lines  show  results  with  RCI,  RI,  or  Cl, 
using  NB  as  the  attribute  classifier.  The  results  for  Gene 
were  already  discussed  in  Section  3,  and  match  closely  the 
trends  for  Cora  and  Citeseer.  In  particular,  for  high  label 
density  d,  Cl  (and  sometimes  RCI)  performs  best;  but,  when 
labels  are  sparse,  using  neighbor  attributes  (with  RCI  or  RI) 
is  essential.  When  d  <  10%,  the  gains  of  RCI  and  RI  vs.  Cl 
are,  with  one  exception,  all  significant  (see  caption).  Results 
without  SSL-Once  (not  shown)  showed  very  similar  trends. 

With  LR  (see  Table  4  for  partial  details),  the  trends  are 
very  similar:  Cl  is  significantly  better  than  RI  for  high  d, 
but  the  opposite  holds  for  low  d.  Fortunately,  RCI  (using 
neighbor  labels  and  attributes)  yields  good  relative  perfor¬ 
mance  regardless  of  d,  as  was  true  with  NB. 

For  comparison,  the  dashed  lines  in  Figure  2  show  results 
for  RCI  with  LR  (using  MNAC).  For  Cora  and  Citeseer,  this 
method  beats  RCI  with  NB  in  all  cases.  The  gains  are  sub¬ 
stantial  (usually  8-10%)  for  very  sparse  graphs,  and  smaller 
but  still  significant  for  higher  d.  Here,  LR  outperforms  NB 


Cora 


Citeseer 


Gene 


Figure  2:  Accuracy  vs.  label  density.  Solid  lines  show  results  with  NB;  for  these,  filled  (not  hollow)  symbols  indicate 
sig.  differences  vs.  Cl.  Dashed  lines  show  RCI+LR  for  comparison;  for  these,  filled  symbols  mean  sig.  differences  vs. 
RCI+NB.  These  results  (as  with  all  others,  unless  otherwise  stated)  use  SSL-Once  and,  where  needed,  ICA  for  inference. 


Table  4:  Average  accuracy  for  different  combinations  of  models,  learning,  and  inference  algorithms,  using  LR.  Within 
each  column,  we  bold  the  best  result  and  all  values  that  were  not  significantly  different  from  that  result.  For  d  >  10% 
(not  shown),  learning  and  inference  choices  usually  affected  accuracy  by  1%  or  less. 


Label  Density  (d) 

1% 

Cora 

3%  5% 

10% 

1% 

Citeseer 
3%  5% 

10% 

1% 

Gene 

3%  5% 

10% 

RCI+No-SSL 

58.1 

74.7 

78.9 

82.6 

51.5 

63.7 

67.1 

71.2 

63.4 

70.6 

75.7 

78.5 

RCI+SSL-Once 

71.3 

80.0 

81.6 

83.2 

59.1 

67.2 

69.7 

71.4 

67.1 

73.6 

77.6 

79.4 

RCI+SSL-EM 

73.5 

79.6 

81.3 

82.7 

63.1 

68.2 

69.8 

71.2 

70.6 

75.6 

78.0 

78.9 

RCI+SSL-Once-|-Gibbs 

69.7 

79.7 

81.3 

83.0 

59.1 

65.1 

67.6 

69.0 

67.6 

74.2 

78.2 

79.8 

RI+No-SSL 

57.4 

73.1 

77.1 

80.6 

51.2 

63.1 

66.8 

70.5 

63.1 

68.8 

73.6 

76.3 

RI+SSL-Once 

67.8 

78.7 

80.5 

81.8 

57.9 

66.7 

69.4 

71.3 

65.9 

71.6 

75.7 

78.0 

RI+SSL-EM 

70.2 

78.9 

80.2 

81.3 

62.9 

68.4 

69.7 

71.2 

68.3 

71.4 

76.4 

76.8 

CI+No-SSL 

40.9 

63.1 

70.6 

79.5 

44.3 

59.5 

63.4 

69.6 

60.5 

67.8 

72.2 

75.7 

CI+SSL-Once 

57.2 

75.1 

78.3 

81.3 

52.5 

64.2 

67.7 

69.6 

60.9 

66.7 

70.6 

74.2 

CI+SSL-EM 

64.6 

78.2 

80.2 

82.0 

57.4 

65.8 

68.0 

69.6 

62.8 

67.0 

75.3 

77.2 

CI+SSL-Once+Gibbs 

36.6 

62.8 

71.2 

78.0 

44.3 

54.8 

53.2 

60.8 

61.0 

67.0 

70.4 

74.8 

SelfAttrs 

37.1 

54.1 

59.5 

65.8 

40.7 

55.8 

61.8 

66.7 

59.8 

65.5 

68.4 

71.6 

NeighAttrs 

52.1 

70.9 

74.6 

77.8 

45.1 

58.4 

63.5 

65.8 

59.4 

65.9 

69.5 

71.8 

NeighLabels 

41.2 

63.4 

72.3 

77.9 

31.5 

47.9 

52.6 

55.8 

55.7 

62.6 

66.6 

71.6 

because  the  attributes  are  continuous  (a  challenging  scenario 
for  NB)  and  because  LR  does  not  assume  that  the  (many) 
attributes  are  independent.  For  Gene,  however,  NB  is  gen¬ 
erally  better,  likely  because  NB  can  use  the  4  discrete  at¬ 
tributes,  whereas  LR  must  use  the  54  binarized  attributes. 

Overall,  for  sparsely-labeled  graphs,  neighbor  attributes 
appear  to  be  much  more  useful  than  previously  recognized, 
and  our  new  LR  with  MNAC  sometimes  significantly  in¬ 
creases  accuracy.  For  simplicity,  below  we  only  consider  LR. 

6.2  Varying  Learning  &  Inference  Algorithms 

The  results  above  all  used  SSL-Once  for  learning  and, 
where  needed,  ICA  for  collective  inference.  Table  4  examines 
results  where  we  vary  these  choices.  First,  there  are  different 
learning  variants:  No-SSL,  where  SSL  is  not  used  (though 
label  regularization  still  is),  and  SSL-EM,  where  SSL-Once 
is  repeated  10  times,  as  with  McDowell  &  Aha  [6].  Second, 
for  RCI  and  Cl  we  consider  Gibbs  sampling  instead  of  ICA. 
Gibbs  has  been  frequently  used,  including  in  the  RI  vs.  Cl 
comparisons  of  Jensen  et  al.  [3],  but  sometimes  has  erratic 
behavior  [7]. 

These  choices  can  impact  accuracy.  For  instance,  SSL- 
Once+Gibbs  sometimes  beats  SSL-Once  with  ICA,  but 
always  by  less  than  1%,  and  decreases  accuracy  substantially 
in  several  cases  with  CI.  Using  SSL-EM  instead  of  SSL- 
Once  leads  to  more  consistent  gains  that  are  sometimes 
substantial,  especially  for  CI. 

Thus,  for  different  datasets  and  classifiers,  the  best  SSL 


and  inference  methods  will  vary.  However,  our  default  use 
of  ICA  with  SSL-Once  was  rarely  far  from  maximal  when 
RCI,  the  best  model,  was  used.  Moreover,  the  values  in  bold 
show  that,  regardless  of  the  learning  and  inference  choices, 
RCI  or  RI  yielded  the  best  overall  accuracy  (at  least  for  the 
data  of  Table  4,  which  focuses  on  sparse  networks).  Also,  us¬ 
ing  RCI  or  RI  with  SSL-Once  almost  always  outperformed 
CI  with  any  learning/ inference  combination  shown. 

For  comparison,  the  bottom  of  Table  4  also  shows  results 
with  three  baseline  algorithms.  NeighAttrs  performs  best 
on  average,  indicating  the  predictive  value  of  neighbor  at¬ 
tributes,  even  when  used  alone. 

6.3  Varying  the  Data  Characteristics 

We  compared  RCI,  RI,  and  CI  on  the  synthetic  data. 
First,  varying  the  label  density  produced  results  (not  shown) 
very  similar  to  Gene  in  Figure  2.  Second,  in  Figure  3(a),  as 
homopliily  increases,  both  RI  and  CI  improve  by  exploiting 
greater  link-based  correlations.  RCI  does  even  better  by 
using  both  kinds  of  link  features,  except  at  low  homophily 
where  using  the  (uninformative)  links  decreases  accuracy. 

Figure  3(b)  shows  the  impact  of  adding  some  random  at¬ 
tributes  for  each  node.  We  expected  Cl’s  relative  perfor¬ 
mance  to  improve  as  these  were  added,  based  on  the  results 
and  argument  of  Jensen  et  al.  [3]:  CI  has  fewer  parame¬ 
ters  than  RI,  and  thus  should  suffer  less  from  high  variance 
due  to  the  random  attributes.  Instead,  we  found  that  RI’s 
(and  RCI’s)  gains  over  CI  only  increase,  for  two  reasons. 


(a) 


(b) 


Figure  3:  Synthetic  data  results,  using  LR  and  5%  label 
density.  Filled  symbols  indicate  sig.  differences  vs.  Cl. 


First,  Jensen  et  al.  had  fully-labeled  training  data;  for  our 
sparse  setting,  RI  has  a  larger  effective  training  size  (see  Sec¬ 
tion  3),  reducing  variance.  Second,  unlike  Jensen  et  al.  we 
use  cross-validation  to  select  regularization  parameters  for 
the  features.  The  regularization  reduces  variance  for  RI  and 
Cl,  and  is  especially  helpful  for  RI  as  random  attributes  are 
added.  If  we  remove  regularization  and  increase  label  den¬ 
sity,  the  differences  between  Cl  and  RI  decrease  markedly. 

6.4  Additional  Experiments 

Due  to  lack  of  space,  we  can  only  summarize  two  addi¬ 
tional  experiments.  First,  we  evaluated  “soft  ICA”,  where 
continuous  label  estimates  are  retained  and  used  at  each 
step  of  ICA  instead  of  choosing  the  most  likely  label  for 
each  node  [15].  This  slightly  increased  accuracy  in  some 
cases,  but  did  not  change  the  relative  performance  of  RCI 
and  RI  vs.  Cl.  Second,  we  compared  our  results  (using  a 
single  partially-labeled  graph)  vs.  “ideal”  results  obtained 
using  a  fully-labeled  learning  graph  or  a  fully-labeled  infer¬ 
ence  graph.  We  found  that  ideal  learning  influenced  results 
much  more  than  ideal  inference  for  all  methods,  but  that 
Cl  was  (negatively)  affected  by  realistic  (non-ideal)  learn¬ 
ing  much  more  than  RI  or  RCI  were.  Thus,  as  argued  in 
Section  3,  RI’s  and  RCTs  gains  vs.  Cl  for  sparsely-labeled 
networks  seem  to  arise  more  because  of  Cl’s  difficulties  with 
learning  rather  than  with  inference. 


7.  CONCLUSION 

Link-based  classification  is  an  important  task,  for  which 
the  most  common  methods  involve  computing  relational  fea¬ 
tures.  Almost  no  recent  work  has  considered  neighbor  at¬ 
tributes  for  such  features,  because  prior  work  showed  they 
performed  poorly  and  they  were  not  compatible  with  many 
classifiers.  We  showed,  however,  that  for  sparsely-labeled 
graphs  using  neighbor  attributes  (with  RI)  often  signifi¬ 
cantly  outperformed  neighbor  labels  (with  Cl),  and  that  us¬ 
ing  both  (with  RCI)  yielded  high  accuracy  regardless  of  label 
density.  We  also  introduced  a  new  method,  MNAC,  which 
enables  classifiers  like  logistic  regression  to  use  neighbor  at¬ 
tributes,  further  increasing  accuracy  for  some  datasets.  Nat¬ 
urally,  the  best  classifier  depends  upon  data  characteristics; 
MNAC  greatly  expands  the  set  of  possibilities. 

Our  findings  should  encourage  future  researchers  to  con¬ 
sider  neighbor  attributes  in  models  for  link-based  classifica¬ 
tion,  to  include  RI  and  RCI  as  baselines  in  comparisons, 
and  to  re-evaluate  some  earlier  work  that  did  not  consider 
such  models  (see  Table  2).  For  instance,  Bilgic  et  al.  [1] 
studied  active  learning  with  sparse  graphs;  their  optimal 
strategies  could  be  quite  different  if  RI  or  RCI  were  con¬ 
sidered,  since  they  could  tolerate  learning  with  fewer  and 


more  widely-dispersed  labels.  Likewise,  Shi  et  al.  [11]  used 
semi-supervised  ICA  with  Cl  as  a  baseline,  but  these  results 
could  change  substantially  with  RCI  instead. 

Our  results  need  to  be  confirmed  with  additional  datasets 
and  learning  algorithms.  Also,  the  best  methods  should  be 
compared  against  others  discussed  in  Section  2.  Finally,  we 
intend  to  explore  these  effects  in  networks  that  include  nodes 
with  different  types  and  some  missing  attribute  values. 
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